X-MimeOLE: Produced By Microsoft Exchange V6.5
Received: by onstor-exch02.onstor.net 
	id <01C8CD81.78F3ED10@onstor-exch02.onstor.net>; Fri, 13 Jun 2008 11:15:34 -0700
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Content-class: urn:content-classes:message
Subject: RE: Proposed design for new(ish) boot procedure for Cougar
Date: Fri, 13 Jun 2008 11:15:34 -0700
Message-ID: <BB375AF679D4A34E9CA8DFA650E2B04E03E9A8FB@onstor-exch02.onstor.net>
In-Reply-To: <BB375AF679D4A34E9CA8DFA650E2B04E0A6E8AE4@onstor-exch02.onstor.net>
X-MS-Has-Attach: 
X-MS-TNEF-Correlator: 
Thread-Topic: Proposed design for new(ish) boot procedure for Cougar
Thread-Index: AcjNBY7jAxHG5lQkSAitBSg9o+F/nAAbC2eQAANCx6AAAHaN0AAAF6oQ
From: "Chris Vandever" <chris.vandever@onstor.com>
To: "Maxim Kozlovsky" <maxim.kozlovsky@onstor.com>,
	"Jobi Ariyamannil" <jobi.ariyamannil@onstor.com>,
	"Andy Sharp" <andy.sharp@onstor.com>,
	"Ian Brown" <ian.brown@onstor.com>
Cc: "dl-Design Review" <dl-designreview@onstor.com>,
	"Brian Stark" <brian.stark@onstor.com>,
	"Warren Gale" <warren.gale@onstor.com>

You'll have to try harder than that.  Jobi has to restart his SSC
daemons because he's actually trying to use his cheetah as a filer.
However, if you have no clients and only care about the ssc daemons,
well, that's another story...

-----Original Message-----
From: Maxim Kozlovsky=20
Sent: Friday, June 13, 2008 11:12 AM
To: Jobi Ariyamannil; Andy Sharp; Ian Brown
Cc: dl-Design Review; Brian Stark; Warren Gale
Subject: RE: Proposed design for new(ish) boot procedure for Cougar

Oh well. This must be a part of the conspiracy to make Chris give up her
Cheetah.=20

>-----Original Message-----
>From: Jobi Ariyamannil
>Sent: Friday, June 13, 2008 10:57 AM
>To: Maxim Kozlovsky; Andy Sharp; Ian Brown
>Cc: dl-Design Review; Brian Stark; Warren Gale
>Subject: RE: Proposed design for new(ish) boot procedure for Cougar
>
>This does not work on cheetah anymore.
>We need to manually restart a bunch of SSC daemons after resetting the
fp.
>
>-----Original Message-----
>From: Maxim Kozlovsky
>Sent: Friday, June 13, 2008 9:28 AM
>To: Andy Sharp; Ian Brown
>Cc: dl-Design Review; Brian Stark; Warren Gale
>Subject: RE: Proposed design for new(ish) boot procedure for Cougar
>
>
>
>>-----Original Message-----
>>From: Andy Sharp
>>Sent: Thursday, June 12, 2008 8:29 PM
>>To: Ian Brown
>>Cc: dl-Design Review; Brian Stark; Warren Gale
>>Subject: Re: Proposed design for new(ish) boot procedure for Cougar
>>
>>On Thu, 12 Jun 2008 18:34:00 -0700 Ian Brown <ian.brown@onstor.com>
>>wrote:
>>
>>> In production, for the Cheetah, we have always rebooted the entire
>>> box.  There were some daemons that relied on boot up order, thus I'd
>>> guess that you would need to restart the daemons in phase 1 if
>>> you're going to just bounce an embedded core.
>>
>>That's good to know.  What little I know about Cheetah operation would
>>likely fall into the "Lore" category.
>>
>>Phase I is still rebooting the whole box.  Depending on the results of
>>testing, Phase II may never see the light of day. ~:^)
>[MK]
>
>There is no need to restart the daemons. During cheetah development the
>daemons which did care about fp/txrx/fc restarts learned to listen on a
>slot/cpu up/down events and do the right thing. This used to work up to
>3.2, after that I had to give up my cheetah and can't testify on the
>account.
>
>>
>>
>>> Ian
>>>
>>> On Jun 12, 2008, at 6:24 PM, Andrew Sharp wrote:
>>>
>>>                        Cougar Boot Procedure Redesign
>>>                        ______________________________
>>>
>>> Problem
>>> =3D=3D=3D=3D=3D=3D=3D
>>>
>>>     Booting takes far too long on Cougar, and in theory the embedded
>>>     nodes should be rebootable w/o rebooting Linux on the Sibyte
1125.
>>>
>>> Reasons:
>>>     1)    Image load from CF is intolerably slow
>>>     2)    After image load, Linux boot takes the longest but is the
>>>           least likely to need rebooting, resulting in an
unnecessary
>>> 		  bottleneck.
>>>
>>> Solution
>>> =3D=3D=3D=3D=3D=3D=3D=3D
>>>
>>>     Redesign the boot flow to allow the embedded cores to be
>>>     independently booted if Linux is up.
>>>
>>> Proposal
>>> =3D=3D=3D=3D=3D=3D=3D=3D
>>>
>>>     Take a phased approach to implementing a redesigned boot
>>> procedure:
>>>
>>> 	Phase I
>>> 	-------
>>> 	1)  Change SSC PROM to load and boot only Linux.
>>> 	2)  Change FP/TXRX PROM to write a magic cookie in a
>>> 	    predefined memory location indicating its readiness
>>> 	    for it's image to be loaded.
>>> 	3)  Impement an early start Linux daemon that waits for these
>>> 	    boot magic cookies to be set by the embedded cores, loads
>>> 	    their images to the correct memory locations, and signals
>>> 	    to the FP/TXRX when finished.  The FP and TXRX could boot
>>>             while Linux completes its boot steps.
>>>
>>> 	Phase 2
>>> 	-------
>>> 	1)  Through testing, determine what needs to be done to allow
>>> 	    FP/TXRX to be rebooted independently without disturbing
>>> the Linux kernel and each other.  Current daemons that
>>>             communicate with FP/TXRX are not expected to be much
>>> trouble since they had to handle this for Cheetah, although this has
>>>             not been extensively tested on Cheetah in the last few
>>>             releases.
>>>
>>> Expected Results
>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>>>
>>> Phase I
>>> -------
>>>
>>> Current boot time           Predicted Boot time        Predicted
>>> savings -----------------           -------------------
>>> ----------------- 2 minutes, 57 secs          1 minute, 43.7
>>> secs        1 minute, 13.7 secs
>>>
>>> 42% reduction in boot time: current boot time* is 2:57, resulting
boot
>>> time is estimated to be 1:43.7, or, a savings of 1:13.7, or, the new
>>> method would boot 1.7 times faster (2 times faster, or twice as
fast,
>>> would be a 50% reduction in boot time).
>>>
>>> These estimations based on a difference in image load time for the
>>> FP/TXRX of 86 seconds for the PROM, and 12.7 seconds for Linux (cold
>>> cache).
>>>
>>>
>>> Phase II
>>> --------
>>> If just rebooting one or both of the FP/TXRX nodes, boot time
>>> estimated to be in the sub 10 second range.  This would
substantially
>>> increase customer satisfaction and supportability, as well as
>>> resulting in a substantial increase in developer efficiency.
>>>
>>>
>>>
>>>
>>>
>>> * Boot time measured from when PROM code starts loading the first
boot
>>> image to when nfxsh CLI is available.
>>>
